Skip to content

Quarto notebooks#78

Merged
KevinMLanderos merged 4 commits into
mainfrom
quarto-notebooks
Apr 22, 2026
Merged

Quarto notebooks#78
KevinMLanderos merged 4 commits into
mainfrom
quarto-notebooks

Conversation

@KevinMLanderos
Copy link
Copy Markdown
Collaborator

Adapt notebooks to new "patient" workflow.
Phenotypic plasticity still to be fixed when input data is single-cell

@KevinMLanderos KevinMLanderos merged commit e694ce2 into main Apr 22, 2026
1 check passed
@dimalvovs dimalvovs requested a review from Copilot April 22, 2026 18:19
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds/updates a suite of Quarto notebook templates to support the new “patient” workflow, expanding the reporting pipeline with QC, overlap/dynamics, sharing/publicity, and phenotype-focused analyses (bulk + single-cell).

Changes:

  • Introduces new/updated Quarto templates for QC, overlap/clonal dynamics, sharing/publicity, GLIPH, and phenotype analyses.
  • Adds “Details” report entry points split into template_details_part1.qmd and template_details_part2.qmd that include the new sections.
  • Updates .gitignore to allowlist additional notebook templates under notebooks/.

Reviewed changes

Copilot reviewed 9 out of 12 changed files in this pull request and generated 15 comments.

Show a summary per file
File Description
notebooks/template_sharing.qmd New sharing/publicity analysis plots (patient-level sharing + Pgen correlation).
notebooks/template_qc.qmd New/expanded QC report with many interactive plots (diversity, overlap, rarefaction, PCA, etc.).
notebooks/template_pheno_sc_details.qmd Single-cell phenotype similarity (Morisita) “details” view.
notebooks/template_pheno_sc.qmd Single-cell phenotype composition + phenotype plasticity (UpSet) analyses.
notebooks/template_pheno_bulk.qmd Bulk phenotype inference via TCRpheno and weighted phenotype composition plots.
notebooks/template_overlap.qmd Longitudinal clonal dynamics: Fisher tests, heatmaps, trajectories, and summary tables.
notebooks/template_gliph.qmd GLIPH2 visualization templates (network, motifs, logos).
notebooks/template_details_part1.qmd “Details” report entry point (part 1) including setup + sample notebook include.
notebooks/template_details_part2.qmd “Details” report entry point (part 2) including overlap + sharing includes.
.gitignore Adds allowlist entries so new templates in notebooks/ can be tracked despite notebooks/* ignore.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

For this section to be rightfully executed, you must have had provided cells and their associated phenotypes.
:::

Single-cell technologies (e.g. 10X) allow the TCR sequence capture together with the RNA expression caracterization, which allows for cell phenotype assignment.
Copy link

Copilot AI Apr 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo: “caracterization” should be “characterization”.

Suggested change
Single-cell technologies (e.g. 10X) allow the TCR sequence capture together with the RNA expression caracterization, which allows for cell phenotype assignment.
Single-cell technologies (e.g. 10X) allow the TCR sequence capture together with the RNA expression characterization, which allows for cell phenotype assignment.

Copilot uses AI. Check for mistakes.
Comment thread notebooks/template_qc.qmd
Heatmaps display the pairwise similarity between TCR repertoires using the Jaccard (overlap) index. Hierarchical clustering groups samples based on their shared clonal content, facilitating the identification of biological replicates or potential cross-contamination events. The color intensity reflects the degree of similarity, where darker red indicates a higher proportion of shared clonotypes between sample pairs.

::: {.callout-tip title="Note"}
When calculating Jaccard for this specific purpose (contamination), we are calculating it on the nucleotide sequence, not the amino acid sequence, **whenever the TCRseq whas done using DNA as input**. Convergent recombination (different DNA making the same Amino Acid) is common in high-depth samples and will give you a false positive "contamination" signal if you stick to Amino Acids. Contamination should be measured at the physical level (DNA).
Copy link

Copilot AI Apr 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo: “whas” should be “was”.

Suggested change
When calculating Jaccard for this specific purpose (contamination), we are calculating it on the nucleotide sequence, not the amino acid sequence, **whenever the TCRseq whas done using DNA as input**. Convergent recombination (different DNA making the same Amino Acid) is common in high-depth samples and will give you a false positive "contamination" signal if you stick to Amino Acids. Contamination should be measured at the physical level (DNA).
When calculating Jaccard for this specific purpose (contamination), we are calculating it on the nucleotide sequence, not the amino acid sequence, **whenever the TCRseq was done using DNA as input**. Convergent recombination (different DNA making the same Amino Acid) is common in high-depth samples and will give you a false positive "contamination" signal if you stick to Amino Acids. Contamination should be measured at the physical level (DNA).

Copilot uses AI. Check for mistakes.

- **Full Timeline:** This table displays the clone's frequency across all sequenced timepoints.
- **Interpretation:** This view allows you to categorize the clone's overall behavior. For example, you can distinguish between a sustained expansion (frequency rises and stays high) versus a transient expansion (frequency rises and then drops back to zero).
- Note: A frequency of 0.0 usually indicates the clone fell below the limit of detection for that sample, rather than true biological absence.w the limit of detection for that sample, rather than true biological absence.
Copy link

Copilot AI Apr 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There’s a duplicated fragment/typo in this sentence (“absence.w ... absence.”) which will render oddly. Remove the stray “w” and the duplicated clause so the note reads cleanly.

Suggested change
- Note: A frequency of 0.0 usually indicates the clone fell below the limit of detection for that sample, rather than true biological absence.w the limit of detection for that sample, rather than true biological absence.
- Note: A frequency of 0.0 usually indicates the clone fell below the limit of detection for that sample, rather than true biological absence.

Copilot uses AI. Check for mistakes.
Comment on lines +569 to +582
for pat in unique_patients:
print(f"## {pat}\n")

# Filter data for the current patient
pat_df = plot_df[plot_df[subject_col] == pat]

# Group by clone (cdr3) and aggregate the unique phenotypes it appears in
# Note: adjust 'cdr3' if your sequence column is named differently
# (e.g., 'junction_aa', 'clone_id')
if 'cdr3' not in pat_df.columns:
print("Error: 'cdr3' column not found in data. Please check your clonotype column name.")
continue

clone_phenotypes = pat_df.groupby('cdr3')['phenotype'].apply(lambda x: list(set(x)))
Copy link

Copilot AI Apr 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This UpSet plot helper checks for a cdr3 column, but earlier in this notebook the clonotype column used is CDR3b. As written, this will print an error for every patient and skip plot generation. Use the correct column name (or make it configurable/detect CDR3b/junction_aa etc.) so the plot actually renders.

Suggested change
for pat in unique_patients:
print(f"## {pat}\n")
# Filter data for the current patient
pat_df = plot_df[plot_df[subject_col] == pat]
# Group by clone (cdr3) and aggregate the unique phenotypes it appears in
# Note: adjust 'cdr3' if your sequence column is named differently
# (e.g., 'junction_aa', 'clone_id')
if 'cdr3' not in pat_df.columns:
print("Error: 'cdr3' column not found in data. Please check your clonotype column name.")
continue
clone_phenotypes = pat_df.groupby('cdr3')['phenotype'].apply(lambda x: list(set(x)))
clonotype_candidates = ['CDR3b', 'cdr3', 'junction_aa', 'clone_id']
clonotype_col = next((col for col in clonotype_candidates if col in plot_df.columns), None)
if clonotype_col is None:
print(
"Error: no clonotype column found in data. Expected one of: "
f"{', '.join(clonotype_candidates)}."
)
print(":::\n")
return
for pat in unique_patients:
print(f"## {pat}\n")
# Filter data for the current patient
pat_df = plot_df[plot_df[subject_col] == pat]
# Group by clone and aggregate the unique phenotypes it appears in
if clonotype_col not in pat_df.columns:
print(
f"Error: '{clonotype_col}' column not found in data for patient {pat}. "
"Please check your clonotype column name."
)
continue
clone_phenotypes = pat_df.groupby(clonotype_col)['phenotype'].apply(lambda x: list(set(x)))

Copilot uses AI. Check for mistakes.
Comment thread notebooks/template_qc.qmd
Comment on lines +2013 to +2042
Simulates rarefaction by expanding counts and shuffling.
"""
# Flat array where each element is a clone ID, repeated by its count
# Use integer IDs for memory efficiency instead of strings
counts = sample_df['counts'].values
n_reads = counts.sum()

if n_reads == 0:
return pd.DataFrame()

# Generate clone IDs (0 to N-1)
ids = np.arange(len(counts))

# e.g., if clone 0 has 2 counts -> [0, 0]
# WARNING: High memory usage for very deep sequencing
flat_repertoire = np.repeat(ids, counts)

# Shuffle to simulate random sampling
np.random.shuffle(flat_repertoire)

# Define depths to check (from 0 to total reads)
depths = np.linspace(0, n_reads, num=steps, dtype=int)
depths = np.unique(depths)[1:] # Remove 0 and duplicates

results = []
for d in depths:
subsample = flat_repertoire[:d]
n_unique = len(np.unique(subsample))
results.append({'sample': sample, 'depth': d, 'richness': n_unique})

Copy link

Copilot AI Apr 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This rarefaction implementation materializes the full read-level repertoire with np.repeat(ids, counts), which can allocate O(total_reads) memory per sample (often millions+) and can easily OOM / stall notebooks on real cohorts. Consider an algorithm that samples counts without expanding to read-level (e.g., multinomial/hypergeometric-based subsampling per depth, or iterative reservoir-style sampling) and/or cap max depth / number of samples.

Suggested change
Simulates rarefaction by expanding counts and shuffling.
"""
# Flat array where each element is a clone ID, repeated by its count
# Use integer IDs for memory efficiency instead of strings
counts = sample_df['counts'].values
n_reads = counts.sum()
if n_reads == 0:
return pd.DataFrame()
# Generate clone IDs (0 to N-1)
ids = np.arange(len(counts))
# e.g., if clone 0 has 2 counts -> [0, 0]
# WARNING: High memory usage for very deep sequencing
flat_repertoire = np.repeat(ids, counts)
# Shuffle to simulate random sampling
np.random.shuffle(flat_repertoire)
# Define depths to check (from 0 to total reads)
depths = np.linspace(0, n_reads, num=steps, dtype=int)
depths = np.unique(depths)[1:] # Remove 0 and duplicates
results = []
for d in depths:
subsample = flat_repertoire[:d]
n_unique = len(np.unique(subsample))
results.append({'sample': sample, 'depth': d, 'richness': n_unique})
Simulates rarefaction by sampling directly from clone counts without
materializing a read-level array.
"""
counts = sample_df['counts'].to_numpy(dtype=np.int64, copy=False)
n_reads = int(counts.sum())
if n_reads == 0:
return pd.DataFrame()
# Define depths to check (from 0 to total reads)
depths = np.linspace(0, n_reads, num=steps, dtype=int)
depths = np.unique(depths)[1:] # Remove 0 and duplicates
# Track remaining reads per clone and whether each clone has been seen
remaining = counts.copy()
seen = np.zeros(len(counts), dtype=bool)
richness = 0
prev_depth = 0
results = []
for d in depths:
draw_size = int(d - prev_depth)
if draw_size > 0:
total_remaining = int(remaining.sum())
draws_left = draw_size
for i in range(len(remaining)):
if draws_left == 0 or total_remaining == 0:
break
clone_remaining = int(remaining[i])
if clone_remaining == 0:
continue
if i == len(remaining) - 1:
drawn = draws_left
else:
drawn = int(
np.random.hypergeometric(
clone_remaining,
total_remaining - clone_remaining,
draws_left
)
)
if drawn > 0:
remaining[i] -= drawn
if not seen[i]:
seen[i] = True
richness += 1
draws_left -= drawn
total_remaining -= clone_remaining
results.append({'sample': sample, 'depth': int(d), 'richness': richness})
prev_depth = int(d)

Copilot uses AI. Check for mistakes.
**Figure 3: Longitudinal tracking of top contracted TCR clonotypes.** The temporal dynamics of the 15 most significantly downregulated clonotypes, ranked by lowest FDR-adjusted p-value ($q$-value). Blue lines trace the $log_{10}$ frequency of individual clones that exhibited significant contraction ($FC < 0.5, q < 0.05$) across the sampled timepoints.


### Clonal Expansion/Contraction Paiwise Comparisons
Copy link

Copilot AI Apr 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo in header: “Paiwise” should be “Pairwise”.

Suggested change
### Clonal Expansion/Contraction Paiwise Comparisons
### Clonal Expansion/Contraction Pairwise Comparisons

Copilot uses AI. Check for mistakes.
Comment on lines +735 to +741
if 'timepoint_order' in locals():
cdr3_df['timepoint_rank'] = cdr3_df[timepoint_col].map(timepoint_order)
else:
ordered_timepoints = sorted(cdr3_df[timepoint_col].unique())
timepoint_order_fallback = {t: i for i, t in enumerate(ordered_timepoints)}
cdr3_df['timepoint_rank'] = cdr3_df[timepoint_col].map(timepoint_order_fallback)

Copy link

Copilot AI Apr 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Timepoint ordering here relies on a timepoint_order mapping that is never defined in this notebook, so it always falls back to sorted(timepoints). That can mis-order timepoints like T1, T2, T10 and make longitudinal plots misleading. Build the ordering map from timepoint_order_col (e.g., from meta or clonotypes_df) and use it consistently.

Suggested change
if 'timepoint_order' in locals():
cdr3_df['timepoint_rank'] = cdr3_df[timepoint_col].map(timepoint_order)
else:
ordered_timepoints = sorted(cdr3_df[timepoint_col].unique())
timepoint_order_fallback = {t: i for i, t in enumerate(ordered_timepoints)}
cdr3_df['timepoint_rank'] = cdr3_df[timepoint_col].map(timepoint_order_fallback)
timepoint_order = None
timepoint_order_col_name = locals().get('timepoint_order_col')
if timepoint_order_col_name is not None:
order_source = None
if 'meta' in locals() and timepoint_col in meta.columns and timepoint_order_col_name in meta.columns:
order_source = meta[[timepoint_col, timepoint_order_col_name]].dropna().drop_duplicates(subset=[timepoint_col])
elif timepoint_col in clonotypes_df.columns and timepoint_order_col_name in clonotypes_df.columns:
order_source = clonotypes_df[[timepoint_col, timepoint_order_col_name]].dropna().drop_duplicates(subset=[timepoint_col])
if order_source is not None and not order_source.empty:
order_source = order_source.sort_values(timepoint_order_col_name)
ordered_timepoints = order_source[timepoint_col].tolist()
timepoint_order = {t: i for i, t in enumerate(ordered_timepoints)}
if timepoint_order is None:
ordered_timepoints = sorted(cdr3_df[timepoint_col].dropna().unique())
timepoint_order = {t: i for i, t in enumerate(ordered_timepoints)}
cdr3_df['timepoint_rank'] = cdr3_df[timepoint_col].map(timepoint_order)

Copilot uses AI. Check for mistakes.
import json
from IPython.display import HTML, display
import warnings
import matplotlib.pyplot as plt
Copy link

Copilot AI Apr 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

matplotlib.pyplot is imported twice in this block. Drop the duplicate import to reduce noise and keep imports consistent.

Suggested change
import matplotlib.pyplot as plt

Copilot uses AI. Check for mistakes.

This plot provides a powerful, two-dimensional view of your T-cell repertoire, helping you move beyond simple clone lists to understand the functional landscape of the immune response. The central question it helps answer is: **Are the T-cell clones found across many different samples (widely shared) also the ones that are most active and expanded?**

This plot is generated by first calculating the frequency of each unique TCR clone within every sample from the raw data. For each clone, two key metrics are then determined: the total number of samples it appears in (its sharing level) and its single highest frequency across all of those samples (its maximum expansion). **Clones are then categorized as 'Highly Expanded' (>1%), 'Expanded' (0.1%-1%), or 'Non-expanded' (<0.1%) based on this maximum frequency value (see sample Notebook for a detailed axplanation)**. The final stacked bar plot visualizes the count of unique TCRs for each sharing level on the x-axis, with the colored segments revealing the proportion of clones that fall into each expansion category.
Copy link

Copilot AI Apr 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo in narrative text: “axplanation” should be “explanation”.

Suggested change
This plot is generated by first calculating the frequency of each unique TCR clone within every sample from the raw data. For each clone, two key metrics are then determined: the total number of samples it appears in (its sharing level) and its single highest frequency across all of those samples (its maximum expansion). **Clones are then categorized as 'Highly Expanded' (>1%), 'Expanded' (0.1%-1%), or 'Non-expanded' (<0.1%) based on this maximum frequency value (see sample Notebook for a detailed axplanation)**. The final stacked bar plot visualizes the count of unique TCRs for each sharing level on the x-axis, with the colored segments revealing the proportion of clones that fall into each expansion category.
This plot is generated by first calculating the frequency of each unique TCR clone within every sample from the raw data. For each clone, two key metrics are then determined: the total number of samples it appears in (its sharing level) and its single highest frequency across all of those samples (its maximum expansion). **Clones are then categorized as 'Highly Expanded' (>1%), 'Expanded' (0.1%-1%), or 'Non-expanded' (<0.1%) based on this maximum frequency value (see sample Notebook for a detailed explanation)**. The final stacked bar plot visualizes the count of unique TCRs for each sharing level on the x-axis, with the colored segments revealing the proportion of clones that fall into each expansion category.

Copilot uses AI. Check for mistakes.

**<u>Downstream biological analyses</u>**
- **Single-cell**:
It is possile to link TCRs to phenotypic states (exhaustion, activation, tissue localization), which allows the study of clonotype heterogeneity.
Copy link

Copilot AI Apr 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo: “possile” should be “possible”.

Suggested change
It is possile to link TCRs to phenotypic states (exhaustion, activation, tissue localization), which allows the study of clonotype heterogeneity.
It is possible to link TCRs to phenotypic states (exhaustion, activation, tissue localization), which allows the study of clonotype heterogeneity.

Copilot uses AI. Check for mistakes.
This was referenced Apr 26, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants